PPMuSA: PROSITE-Pattern Matcher Using Suffix Array

نویسندگان

Aki Hasegawa

Akihiko Konagaya

چکیده

PPMuSA is a pattern matching program for PROSITE [2] signature patterns against amino acid sequence database. To characterize a group of protein among others, it is important work to find occurrences of a signature in an amino acid sequence of the protein. Such signatures are compiled in databases. The PROSITE is one of the databases of signatures that distinguish members of a protein families or domains from other unrelated proteins. In the PROSITE, the signature is described in several forms. One of the forms is a “pattern” that is a motif description based on a regular expressionlike syntax. For instance, the signature pattern is described in a form such as [AC]-x-V-x(4)-{ED}; or <A-x-[ST](2)-x(0,1)-v. These patterns would be interpreted as [Ala or Cys]-any-Val-any-any-any-any{any but Glu or Asp}; and Nterm-Ala-any-[Ser or Thr]-[Ser or Thr]-(any or none)-Val, respectively. ScanProsite [1] is a reference implementation of a PROSITE scanning tool that can take a pattern of the form as the input and report occurrences of the pattern in amino acid sequences. ScanProsite uses a sequential algorithm as opposed to the indexed one. PPMuSA uses prebuilt suffix array [3] indices on the amino acid sequences in order to achieve a fast pattern matching. The suffix array is a data structure for the information retrieval, and is made of an array of the indices of the suffixes obtained by sorting all suffixes of the text. In the case of string retrieval using suffix array, the time complexity is O(m · log n), (O(m+log n) when auxiliary data structure is used), where m is the length of a pattern string and n is the length of the text. PPMuSA takes a pattern of the form as the input, and reports all the occurrences of the pattern in amino acid sequence database.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient de novo assembly of large genomes using compressed data structures - Supplemental Materials and Methods

The suffix array is a compact representation of the lexicographic ordering of the suffixes of a text [1]. Each element of the array is an index into the original string; SAX [i] = j indicates that the suffix starting at position j in T is the i-th lowest suffix in X. As an example consider the string T = AGATCGATA$. The suffix array of T is SAT = [10, 9, 1, 7, 3, 5, 6, 2, 8, 4]. As the suffix a...

متن کامل

Fast Motif Search in Protein Sequence Databases

Regular expression pattern matching is widely used in computational biology. Searching through a database of sequences for a motif (a simple regular expression), or its variations is an important interactive process which requires fast motif-matching algorithms. In this paper, we explore and evaluate various representations of the database of sequences using suffix trees for two types of query ...

متن کامل

On the Benefit of Merging Suffix Array Intervals for Parallel Pattern Matching

We present parallel algorithms for exact and approximate pattern matching with suffix arrays, using a CREW-PRAM with p processors. Given a static text of length n, we first show how to compute the suffix array interval of a given pattern of length m in O ( m p + lg p+ lg lg p · lg lgn ) time for p ≤ m. For approximate pattern matching with k differences or mismatches, we show how to compute all...

متن کامل

Bottom-k document retrieval

We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears least often. This has potential applications in data mining, bioinformatics, security, and big data. We show that adapting the classical linear-space solutions for this problem is trivial, but the compressed-space solutions are not easy to extend. We design a new solution for this ...

متن کامل

Improved Processing of Path Query on RDF Data Using Suffix Array

RDF is a recommended standard to describe additional semantic information to resources on the Semantic Web. Matono et al. proposed an indexing and query processing scheme for path-based RDF query using a suffix array. In this paper, we indicate some points on the previous approach. We propose an improved indexing and query processing scheme to reduce the binary search space and the overhead cau...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2005

PPMuSA: PROSITE-Pattern Matcher Using Suffix Array

نویسندگان

چکیده

منابع مشابه

Efficient de novo assembly of large genomes using compressed data structures - Supplemental Materials and Methods

Fast Motif Search in Protein Sequence Databases

On the Benefit of Merging Suffix Array Intervals for Parallel Pattern Matching

Bottom-k document retrieval

Improved Processing of Path Query on RDF Data Using Suffix Array

عنوان ژورنال:

اشتراک گذاری